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ABSTRACT 

As device technology is approaching fundamental lim- 
its, ■ future increases in computing power with improved 
cost /performance ratio will be forced to rely on advances in com- 
puter architecture. 

Twine RISC is a novel single chip low cost proce ssor 
architecture which exploits instruction level temporal parallel- 
ism by its well engineered RISC pipeline and spatial parallelism 
by allowing multiple threads of computation to coexist and exe- 
cute in parallel- In this project, a simulator for evaluating 
Twine RISC is developed. Key issues involved in design of archi- 
tecture are generation and synchronization of threads and support 
for split phase transactions for data transfer to/from global 
shared memory. Technological constraints such as available VLSI 
techno 1 ogy (wh i ch decides chip size) and state of the art memory 
technology are’ also considered. 



Ill 


Table of Contents 


Chapter 1. Introduction and Thesis Organization 1 

1.1 Introduction 1 

1.2 Thesis Organization 2 

Chapter 2. Background and Related Work 4 

2.1 Introduction and Overview 4 

2.1.2 Dataflow Graphs 4 

2.2 Dataflow Architectures 5 


2.3 Dataflow/von Neumann Hybrid Architectures 

2.4 Enhancement and Support Toward Multithread- 


ing 11 

Chapter 3. Twine RISC : Its Architecture 18 

3.1 Introduction and Overview ... 18 

3.2 Various Building Blocks 18 

3.2.1 Code Memory 18 

3.2.2 Operand Memory 19 

3.2.3 Token Queue 19 

3.2.4 Sequencer ...;.. 1,9 

3.2.9 Data Queue 20 

3. 2.6 Message Processor 20 

3.3 The Twine RISC Stream Pipeline 21 

3.3.1 Instruction Fetch Unit 21 

3 .3 - 2 Operand' Fetch Un it . 22 

3.3.3 Execution Unit 22 



?.3.4 Result Store Unit .1 25 

3.3.5 Continuation Token Unit 23 

"rapter 4. Twine RISC : Software Environment 25 

4.1 Introduction and' Overview , 25 

4.2 Instruction Set and Its Coding 25 

4.3 Handling Multiple Threads 26 

4.3.1 MFORK : Generation of Multiple Threads 26 

4.3.2 MJOIN : Synchronization of Multiple Threads 

4.4 Data Transfer To and From Global Memory 28 

4.4.1 LOAD : Move Data From Global Memory to 

Operand Memory 29 


4.4.2 RESM : Synchronise Data Transfer and Resume 

4.4.3 STORE ; Move Data From Operand Memory to 


Global Memory 30 

4.5 Instruction Set Summary 31 

Chapter 5. Simulator and Performance Evaluation 32 

5.1 Introduction and Overview ,32 

5.2 Simulator Structure 32 

5.2.1 Input Preparation . .- 1 32 

5.2.2 Execution 33 

5.3 Performance Metrics 35 

5.4 Some Design Issues 35 

5.5 Sunr^ary 36 

Chapter 6. Conclusion and Future Work 37 

6.1 Introduction and.Overview 37 

6.2 Phi losophy 37 

6.3 Surwnary arid Future Work . 37 

Appendix A. Instruction Set 40 



A.l Instruction Set -40 

A. 2 Instruction Execution in TRS Pipeline 40 

A. 2.1 Ordinary RISC Like Instructions 40 

A. 2. 2 Special Instructions 42 

A. 3 Instruction Set Coding 48 

A. 4 Instruction Set Summary .51 

ppendix B. Code Structure for Simulator 32 

ppendix C. User's Mannual and Test Programs 36 

C.l User's Mannual 36 

C.2 Test Programs 37 

References 63 



List of Figures 


1.1 Dataflow Graph for Expression <a*b> +< c*d ) 3 

.2.1 Static Dataflow Architecture 6 

. 2 . 2 PE for Tagged Token Dataflow Architecture 8 

.4.1 State Transition Diagram for an I-structure Cell 15 

.1 Twine RISC Processor Architecture 17 

.2.1 Instruction Set , 24 

(.1,1 Instruction Set • 39a 

it. 2.1 Twine RISC Processor Architecture 39b 

^.5.1 Instruction Set Coding ,50 

4.3.2 Instruction Set Summary -51 

C.2.1 Sequential and Parallel Control Flows for Loads ...... 58 

C.2.2 CMF for Prog . 1 59 

C.2.3 GMF. TQF, RSF . ROUT for Prog.l 60- 

C . 2 . 4.- Concur rent Loads and Iterations . . .- 62 



Chapter 1_ : Introduct i on and Thes i s Organ i zat i on 

i-*i. I nt roduct i on ; • 

Fine grain mul t i computers offer the potential of a signi- 
i cant increase in maximum computing power with greatly improved 
ost/perf ormance ratio. Significant challenges exist in processor 
rchitecture for these machines. 

RISC processors exploit instruction level 
>ara 1 1 e 1 i sm< Temporal parallelism) whereby an instruction pipeline 
,s kept busy and performs more than one operations for various 
instructions. A significant amount of easily detectable parallel- 
ism actually exists in most general purpose codes. Dataflow 
architectures appear to be the most suitable for exploiting such 
paraLlelism as they support generation and coordination of paral- 
lel activities directly in hardware and can tolerate long 
unpredictable communication delays C3]. There has been a con- 
sistent convergence toward a ’’practical" architectural framework 
for implementing dataflow machines. The dataflow/von Neumann 
hybrid architecture is a new phase of evolution in computer 
architecture to exploit both temporal as well ae spatial paral- 
lelism C2]. 

The context of this thesis work is to simulate a noveT 
processor architecture called Twine RISC and to enhance it. Twine 
RISC is a low cost single chip processor architecture which 
exploits instruction level parallelism by its well engineered 
RISC pipeline and spatial parallelism by allowing multiple 



threads of computation to coexist and execute In parallel. 

L'Z. f hes i s Organ i zat i on ; 

The rest of the thesis is organized as follows ! In 
Chapter 2, we provide background and related work toward "practi- 
cal" architectural framework for dataflow/von Neumann hybrid 
architectures. In chapter 3, architecture of the Twine RISC pro- 
cessor is discussed. Chapter 4 deals with the software environ- 
ment for the Twine RISC. In chapter 3. simulator to test and 
evaluate Twine RISC is discussed. Finally chapter 6 is the con 7 
eluding chapter of the thesis. It contains a brief summary and 
discussion on future work in this area. 

I 

Appendix A gives complete instruction set and execution 
flow policy for Twine RISC. Appendix B is a condensed specifica- 
tion of the simulator. Appendix C describes Input/Output specifi- 
cations and User’s manual for simulator. Some test programs and 
performance results are also included. 




fig 2-i.L Dataflow graph for expression (A*B) + (C^D) 




Chapter Z. Background and related work 

Z-’l. Iritroduct ion and Overv i ew ; 

In this chapter we discuss in short the dataflow graphs 
and their ability to represent maximum available parallelism in a 
program. There have been several attempts of building machines 
capable of executing dataflow graphs. We discuss some of them in 
Section 2.2. In Section 2.5 dataflow/von Neumann hybrid architec- 
tures are discussed. Finally in section 2.4 we discuss enhance- 
ments in conventional RISC architecture for providing support for 
multithreading. These include enhanced memory model, split phase 
transactions and primitives for multithreading. 

Dataflow graphs : 

Dataflow graphs are powerful intermediate representations 
for compilers. They are directed graphs in which nodes represent 
primitive functions such as ADD, SUB .. etc., and the arcs 
Represent data dependencies between functions^ Dataflow graphs 
specify only a partial order for the execution of instructions 
and thus provide opportunities for parallel and pipelined execu- 
tion at the level of individual instructions. For example, the 
dataflow graph for the expression [a*b + c*d3 only specifies that 
both multiplications be executed before the addition. however, 
the multiplications can be executed in any order or even in 

parallel. The avantage of this flexibility becomes apparent when 

A 

we consider that the order in which a, b, c . and d will become 
available may not be known at compile time. 
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This strategy implicitly introduces sequencing between 
instructions which depend on each other, but .allows instructions 
to execute in parallel if there exist no dependency between them. 

So it is very clear that if dataflow graphs are executed 
directly, the machine can exploit maximum available parallelism 
in computation 

Z.Z_ Dataf 1 ow arch i tectures : 

Dataflow architectures are language based architectures 
in which dataflow program graphs are the base language. Here 
dataflow graphs constitute a formal interface between dataflow 
architectres and user programming languages. 

We can view dataflow graphs as a machine language for a 
parallel machine where a node in a dataflow graph represents a 
machine instruction. Each instruction contains an opcode and a 
list of destination instruction addresses. 

An instruction or a node may execute whenever 
token < operand data) is available on each of its input arcs and 
that when it fires<i.e. operation is performed on its input 
tokens), the input tokens are consumed, a result is computed, and 
a result token is produced on each output arc, which may be an 
input token for another node in the graph. 

This dictates; the following basic instruction cycle : 

a. Detect when*an operation is enabled( when ail operands values r- j 
available) 

b. Determine the operation to be performed, i.e. fetch instruc- ; 


t i on . 



Swltoh(OUT) •, Sv¥ltch(IN) 
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An Activity Template 











c. Compute results 

d. generate result tokens 

This is the basic instruction cycle of any dataflow 
machine, however, there remains tremendous flexibility on the 

details of how this cycle is performed C?]. 

\ 

There has been a cos i stent convergence toward a "practi- 
cal" architectural framework for implementing dataflow machines. 
Several architectures on dataflow concept have been proposed, 
some of which have been implemented in experimental machines ZZl . 
Examples are : . ' 

1_. Stat i c Dataflow Mach i ne Pro i ects : 

- The MIT Static Dataflow Machine C53 

- The NEC Dataflow Machines NEDIPS and IPP 

In these machines, data tokens are assumed to move along 
the arcs of the dataflow program graph to the operator nodes. The 
nodel operation gets executed when all its operand data are 
present at the input arcs. Also all output arcs of a node be 
empty before that node is enabled. A token moves to the next unit 
only after that unit has signalled that it can accept the token. 
Only one token is allowed to exist on any arc at given time. The 
restriction cannot be enforced at hardware level, but its effect 
can be achieved by executing only graphs that have the property 
whereby no more than one token can reside on any arc at any stage 
of execution. 

Basic model of static dataflow machine architecture is 
shown in f i g . 2, *2.-1 . 



nput 



Pig 2.1-2. PE for Tagged Token Dataflow Machine 
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• Dynami c ( Tagged Token ) Dataflow Machine Pro i ects : 

- The Manchester Dataflow Machine [17] 

- SIGMA 1 at Electrotechnical Laboratory, Japan [183 

- The MIT Tagged Token Machine [5,7] 

- Monsoon : an Explicit Token-Store Arch i tecture (MIT) 

?,14] 

These machines use tagged tokens, so that more than one 
oken can exist on an arc. The tagging is achieved by attaching a 
abel with each token which uniquely identifies the context of 
hat token. A node is identified by a pair, code block and 
nstruction address. Tags have four parts, viz; invocation ID, 
•teration ID, code block, and instruction address. The iteration 
[D distinguishes between different iterations of a particular 
invocation of a loop code block, while the invocation ID distin- 
guishes between different invocations. If the graph is cyclic, 
the tagging allows dynamic unfolding of the iterative computa- 
tions and thereby exploits maximum available parallelism. 

Basic processing element architecture model of the tagged 
token machine is shown in f i g .j^St-2-2 . 

Some defects of these machines are as follows; 

1. A circular pipeline does not work well as a "pipeline” for 
less parallel execution. It may occur that only one token is 
going round the pipeline cycle, and that PE throughput is less 
than one per a pipeline circular time. , 

Z- Simple packet-based architecture cannot exploit registers or 
a register file efficiently. As token is always realized as a 



packet and each of the packets enter a PE whenever possible, it 
is nonsence to reserve tokens in registers for the future node 
operation. This is one of the main reason why a fine pitch pipe- 
line is difficult to implement in a dataflow machine. 

J. For matching hardware and time complexity are heavy. 

4. Packet flow traffic is too heavy. 

It takes much time to eliminate garbage tokens. which are 
generated while executing switch operations for conditional com- 
putations. 

Z.-2. Dataf 1 ow / von Neumann Hybr i d Arch itectures : 

In the previous section we' have reported several 
shortcomings of pure dataflow machines. It has been realized 
that to overcome these shortcomings following changes in the 
design are necessary [11,13,163. 

1. Improve machine performance by integrating a packet based 
circular pipeline of dataflow machines and a- register based 
advanced control pipeline of von Neumann machines. 

2. Use RISC based single chip PE design to simplify architec- 
ture, and a direct matching scheme with large register file. 

These lead to development of Dataflow/von NeMf«ann hybrid 
architectures which can exploit both conventional von Neumann and 

j. 

dataflow compiling technology. 

Examples of architectures falling in this class are P- 
RISC<MIT) Cl 33 and EMC-R < E 1 ectrot echn i cal Laboratory, Japan) [163- 


P-RISC : 



P-RISC( for Parallel RISC) can be viewied as a dataflow 
rhine that can achieve software compatibility with conventional 
T Neumann machine. Distinctive features are RISC like 3~address 
structions that opoerate entirely within a processing element, 
AD/STORE instructions to move data in and out of the PE, I- 
ructure type storage model. Collection of frames on a PE is 
garded as a collection of register sets, a particular register 
t being identified by frame pointer. FORK and JOIN are instruc- 
ons for thread .initiation and synchronization. The processor 
pe and the token queue form a ring around which tokens are cir- 
jlatedCl?]. 

EMC -R ; 

v1C-R is a PE for a parallel computer EM-4 built at E 1 ect rot echn- 
cal Laboratory, Japan. Distinctive features of EMC-R architec- 
ure are strongly connected arc dataflow model, a direct match- 
ng scheme and register based sequencing, a RISC based design, 
nd an integration of a packet based circular pipeline and a 
egister based advanced control p i pe 1 i ne C 1 6 H . 


L-A Enhancement and Support towards Mult i thread i ng ; 

Two fundamental issues in multiprocessing/multithreading 

♦ 

are memory latency and waits for synchronization events. Both are 
very expensive on von Neumann mach i nes C 4] C 6 ] . 



Memory latency ; 

Memory latency i 
between making a requ 
■from memory. In a von N 
mines the time to ex 
finally determines the 
Most von Neumann proce 
memory references, and s 
to reduce memory laten 
be capable of issuing mu 
A different memo 
few dataflow machines t 
ing overall throughput, 
structure memory are 
split-phase transactions 


s defined as the time which elaps 
est and receiving the associated respon 
eumann processor, memory latency dete 
ecute memory reference instruction, whi 
maximum instruction processing’ spee 
ssors are likely to be "idle” during lo 
uch references are unavoidable. In ord 
cy cost, it is essential that a process 
Itiple overlapped memory requestsC73. 
ry model, I - structure, is used in 
o tolerate memory latency thereby incre 
The transactions for processor to 
in terms of messages and are termed 
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I_ - structure Memory ! 2 3 


The 

bas i c 

idea behind I 

-structure storage is 

to 

defer a 

data-read if 

the 

corresponding 

location has not been 

wr i 

tten . 

Here 

each 

storage ceil 

contains status bits 

to 

i nd i cate 


that the cell is in one of three possible states. 

a. EMPTY : Nothing has been written into the cell since it was 
last allocated. No attempt has been made to read the cell. It may 
be written as for conventional memory. 

b. FULL : The cell contains valid data that can be freely read 
as in a conventional memory. In a conventional I— structure 
memory, any attempt to write a FULL cell is signalled as an 
error. However, in our model FULL cell can 'be overwritten. 




c. DEFERRED : Nothing has been written into .cell, but at least 
one attempt has been made to read it. When it is written, all 
deferred reads must be satisfied. 

Ceils change state in the obvious ways when presented 
with requests. 

In fact, reads and writes may even get out of order in 
communication networks. Synchronization of reads and writes is 
performed for each cell of the structure, so that there is no 
problem when a read precede the write to a cell. In such a case, 
a deferred read state is created whereby the read is put aside on 
a list with a promise to fulfill the read request when the cell 
is written. This synchronization is implemented using two bits to 
indicate whether a cell is in empty, full or deferred state. A 
pointer to the deffered read list, if one exists, is kept in the 
empty ceil of the structure until the point when the expected 
write takes p lace . < f i g . "I ) 


Split Phase Transact i ons ; 

It is a method by which synchrony between the I - struc- 
ture request and reply, is maintained. A request token is sent to 
I - structure unit(potential iy across the connmun i cat i on network) 
and the processor is then free to continue executing other 
instructions while the request is being delivered and handled. 
This will never cause the processor pipeline to stall. 

Threads : 

Threads are defined as small processes that operate 



almost entirely on local data and rarely interact. They are the 
basic b 1 ocks /un i t s of computation in to which programs are decom" 
posed for parallel execution. Threads can be created dynamically 
during computation and die after having produced and consumed 
data. Threads can be in one of three states : ready to 
execute < queued locally or globally), executing, suspended< wa i t i ng 
for synchronization signal). 

Following modifications are required to a conventional 
RISC to make it suitable for exploiting the fine grain parallel- 
ism of dataflow execution, while still retaining the effecient 
control mechanism of von Neumann comput i ngC 1 5 , 1 6] . 

1. Modify the RISC processor implementation to make it mul- 
tithreading. 

- Implement more than one PEs on a single chip. 

- Include primitives MFORK and MJOIN for generating and 
synchronizing multiple threads of computation. 

2. Augment the RISC processor with I - structure like storage 
with split phase transactions. 

The Twine RISC architecture implants the following 
features. 

1. Loads(memory request) are split phase transactions. There- 
fore, responses can come back in any order. 

2 . The processor switches automatically to another thread of 
computation if it exists rather than being idle. 

3. The processor supports multiple threads. ; 

4. The pipeline is kept full as long as token queue is not 



empty . 

Simultaneous execution of various threads is possible and is 
carried out within the processor. 




. 3 - 1 . Twine RISC Processor Architecture 













Chapter 2 = Twine RISC : Its Architecture ' 

2-i- Introduct i on and Overv i ew 

In this chapter we discuss complete processor architec- 
ture of the Twine RISC. We adopt a RISC architecture for the 
Twine RISC for its simplicity and execution efficiency. Instruc- 
tion level, parallelism is exploited in the context of sequential 
thread executing in a well engineered RISC pipeline. Multithread- 
ing is exploited by. providing more than one streams of execution 
pipeline on single chip. These streams are called Twine RISC 
Streams (TRS ) . , In Section 3.2 we discuss various blocks of Twine 

RISC viz.. Code Memory, Operand Memory, Token Queue, Sequencer, 
Data Queue, and Message Processor. The Twine RISC processor also- 
supports split phase transactions between memory and processor 
through Message Processor and Data Queue. We also discuss this 
mechanism in Section 3.2. Instruction pipeline in a TRS is dis- 
cussed in Section 3.3. Various stages in this pipeline include 
Instruction Fetch Unit, Operand Fetch Unit, Execution Unit, 
Result Store Unit and Continuation token generation Unit.(sec 

2-2 Var i ous Bui 1 di ng B locks 

2-i.l Code Memory (CM) ; 

f 

Code memory holds instructions. Each TRS has an access to 
a CM outside the chip. These CMs are read only memories for TRSs. 
A separate host processor is used to initialize the CM by loading 
a Twine RISC progarm chunck. 



Operand Memory <OM> ; 

It is a register file of 64 registers each 32 bits wide. 
Operand memory is shared by ail TRSs. All TRSs can simultaneously 
write to this QM at different locations and read operands from it 
simultaneously. As clear by the instruction pipeline, each TRS 
has requirements of 2 reads and 1 write per cycle. As the Twine 
RISC processor can have more than one TRS to support spatial 
parallelism, the Operand Memory has 2N read ports and N write 
ports for N TRSs. 

2.2.2 Token Queue <TQ) : 

Token Queue feeds TRSs with the continuation tokens. A 
continuation token is formed with two pointers, viz; a frame 
pointer<FP> and an instruction pointer(IP>. IP indicates the 
location of instruction to be executed in the Code Memory. While 
FP is a base pointer to the set of operands in data memory<OM> 
analogous to the base address of an activation frame for a pro- 
cedure invocation. By using frame relative addressing the same 
code block can have multiple active invocations. Since 
the continuation tokens generated by any of the TRSs correspond 
only to the start address of different threads, they can be 
picked up by any other TRS in the Twine RISC processor. 

Sequencer :• 

The continuation tokens generated in the system are 
stored in the TQ through a Sequencer. The Sequencer samples con- 
tinuation tokens generated by various TRSs and stores them in TQ. 



As these tokens generated by TRSs are independent of each other, 
the sequence in which they are stored in the TQ is irrelevant and 
a program works independent of any sequencing scheme forced by 
the system. This makes the design of the Sequencer relatively 
simple. 

2.-Z.-2. Dg^ta Queue (DQ ) ; 

It is an alternate operand memory for special instruction 
RESM. RESM does not refer OM, instead it reads data from DQ and 
treats them as operands. DQ is an inevitable hardware which 
enables OM to be loaded. When a memory operation LOAD/LOADX is 
issued, upon completion of the operation a message is returned to 
the Message Processor by the external memory controller. This 
message contains a value. continuation token and destination 
register. These data are written in the DQ. When RESM instruction 
is executed data is finally moved from DQ to OM and the thread 
reinitiates. 

Message Processor <MP) : 

MP handles message traffic between the processor and 
external memory. It also implants split ~ phase transactions 
where the requests for read/write to global I-memory are 
dispatched from all TRSs through MP. The MP receives read/write 
requests from various TRSs in the processor and forwards them to 
the external interface for global I-structure memory controller. 
In case of a read request, the MP eventually receives a message 
from the external interface containing value*, operand memory 



location and continuation token. Upon receiving such a message, 
the MP writes data into DQ arid generates a continuation token 
<FP.0>. This is essential as the MP can't store data in OM. 
Corresponding instruction RESM<at location 0 in CM) takes this 
data from DQ and stores it in OM. besides generates the continua- 
tion token. 

2-2 The TRS Pipeline ; 

The TRSs in a Twine RISC processor essentially capture 

S' 

spatial parallelism between different threads of computation. 
Within a TRS the various stages are instruction fetch unit(IFU>, 
operand fetch unit<OFU), execution unit<EXU>, result store 
unit(RSU) and continuation token unit(CTU). All these units 
operate asynchronously with hand shake signals. There is a buffer 
between two successive units. 

2-2-i Instruction Fetch Unit < IFU ) : 

Initially it fetches new token address from TQ and 
fetches instruction from CM. It determines whether the next 
instruction to be fetched from subsequent location or not. For 
instance in the case of various arithmetic/logic instructions, 
the next instruction comes from subsequent location, i.e IP IP 
+ 1. However in case of ^branch instructions the location of next 

instruction is not determined at IF stage so IFU fetches a con 
tinuation token from the TQ and starts another thread. 

Instruction set is organized in such a way that by look- 
ing at the first bit of opcode IFU can determine whether the next 



instruction is to be fetched from IP + 1 or a new thread is to be 
started . ■ ri , 

To prevent race between MJOIN instructions IFU also 
detects MJOIN instruction and stalls other TRS pipelines for 
MJOIN instructions by setting MJOIN lock line. This way. the 
MJOIN instruction is executed in atomic and exclusive manner. 
Opcode for MJOIN is chosen to be 111111 so that the detection 
hardware at IF Unit is simplified. 

Finally IFU prepares a packet 
<6 bit opcode , 6 bit Rl,6 bit R2 , 6 bit R?> 
and sends it to OFU through buffer between IFU and OFU. 

2.3.2 Operand Fetch Unit ( OFU ) : 

This unit recognizes instructions partially by decoding 3 
bits of opcode and decides the number of operands to be fetched 
from operand memory. This unit also decodes RESM instruction in 
which case it fetches operands from DQ. It then generates a 
packet 

<6 bit opcode. 32 bit left operand. 32 bit right operand. 6 bit 
destination register> 

which is sent to the EXU through its input buffer. Once the 

% 

fetch is done, the handshake signal from OFU to IFU causes IFU to 
resume its operation. • 

2 - 1.-2 Execut i on or Funct i ona I Unit ( EXU ) : 

This unit is identical to conventional ALU except that it 
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generates continuation tokens for branch and other special 
instructions for thread instantiation. It prepares a packet 
<52 bit result value, 6 bit destination register> 

and forwards it through buffer to RSU. It also prepares token 
<FP.IP> for CTU and forwards it through buffer queue. For split - 
phase transact i ons < for memory read/write) EXU sends request mes- 
sage to MP and continue. It implements the handshake signal with 
OFU. 

1-1-A Result Store Unit ( RSU ) : 

This is the only stage which can write in to shared DM. 
It writes the result value in destination register. It releases 
M30IN line set by IFU in case of MJOIN instruction. 

Cont i nuat i on Token Un i t ( CTU ) : 

It forwards new thread token (FP.IP) to Sequencer. 

Both RSU and CTU implement handshake signals with EXU. 






Chapter 4. Twine RISC : Its Software Environment 
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A-i I nt roduct i on and Overv i ew 

In this chapter we discuss software support available in 
Twine RISC and details of its i mp 1 ementat i on . In section 4.2 we 
provide the insruction set of Twine RISC. Twine RISC supports 
multiple threads of execution. We discuss the support for creat- 
ing and synchronizing multiple threads in Section 4. '5. In Section 
4.4 instructions that supports memory references between Twine 
RISC and outside shared global memory are discussed. Section 4.5 
summarizes the software environment of Twine RISC. 

4.^ Instruct i on Set and i ts Cod i ng ; 

The instruction set of a Twine RISC is intended to be a 
simple extension of the RISC model. There are totally 19 dif- 
ferent instructions, each of which is capable of being executed 
in a single clock cycle. Instructions are broadly classified into 
two classes viz; Ordinary RISC like instructions and special 
instructions. Special instructions are to support generation and 
synchronization of multiple threads and to handle external memory 
references as split phase transactions. With these special 
instructions it' is possible to simulate the fine grained, asyn- 
chronous parallelism of dataflow execution. 

The instruction set of Twine RISC is coded in a way that 
by decoding minimum number of bits, instructions are recognized 
at various units of TRS pipeline. (See Appendix A). For example 



IFU decodes only one bit of instruction opcode to determine the 
next instruction fetch, wh ether it should be from code memory or 
from token queue. 

The instructions available in Twine RISC are given in 

tabl e»^. 

ii-l. Hand 1 i ng Mu 1 1 i p 1 e Threads 

To implement concurrent threads Twine RISC supports instructions 
MFORK and MJQIN. MFORK creates multiple threads while MJOIN syn- 
chronizes them. 

.i-l.-!. MFORK ; Generation of multiple threads 

MFORK instruction is a method of spawning parallel 
threads of computation from within an executing thread. By using 
this instructions upto a maximum of 5 threads are created. One of 
these threads is the parent thread with continuation <FP.IP + 1>. 
Addresses of new threads to be generated are kept in a 32 bit i 
operand. Each address is relative to the address of MFORK | 
instruction and is specified in 8 bits. Thus upto 4 addresses are 
stored in a single 32 bit operand. Execution of this instruction I 
causes continuation token <FP.IP + byte offset> to be generated 
for each non zero vzlue of offset. The number of threads thus ; 
created is stored in a location which can later be used by - 
MFORK's dual instruction MJOIN for synchronizing execution. | 

A-1-2 M30IN : Synchronization of multiple t hr ea .ds ^ ; 



MJOIN instruction allows iriultiple thrsads to synchronizs 
Content ot the OM location specified in MJOIN instruction is 
decremented by 1 for each execution of MJOIN. This location is 
set by MFORK instruction to the number of threads. As each thread 
calls MJOIN exactly once, the only thread which finds this loca- 
tion equal to zero after decrementing is the last thread execut- 

ing MJOIN. It is allowed to continue and all other threads die. 
The MJOIN instruction generates continuation token <FP.IP + 1> 
for the last thread and thus execution is synchronized. 

As code is systematically compiled from dataflow graphs 
and processor is multithreaded, instructions from unrelated 
threads will not compete for the same location of OM. There will 
always be an adequate number of MJOINs to prevent races between 
normal instructions. The exception is that there can still be a 
race between two or more MJOIN instructions competing for the 

same location. Each MJOIN instruction reads OM location, tests 

it. and writes it back, this must be atomic. To handle such a 
race condition, MJOIN is executed in exclusion,. Th i s i s done in- 
the following manner. 

If the IF unit fetches MJOIN instruction and MJOIN lock 
is set thenthe instruction is not passed to OFU else IFU sets the 
global MJOIN lock and passes MJOIN instruction to the OFU. The 
MJOIN lock thus prevents any other TRSs -to execute MJOIN as^ next 
instruction. As all units operate asynchronously with handshake 
signals other TRS pipelines are stalled whenever they fetch MJOIN 
instruction. However the pending instructiqns in the pipeline can 
still continue to execute., After MJOIN is being executed 
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atorvi i ca. 1 1 y , RS unit unsets the MJOIN lock line after updating the 
OM location. 

Data tranf er to and f rom global memory 

Twine RISC model assumes the shared memory address space 
accessed by all TRSs. This memory is conceived as an I-structure 
like global memory. Any memory reference arising in a Twine RISC 
processor is sent out as a split phase transaction to I-structure 
memory controller. To support this there are LQAD/STORE/RESM 
instructions with hardware support of message processor and data 
queue. When a memory ref erence <LOAD/LOADX> is issued, the thread 
suspends; upon completion of the operation a value is sent from 
memory, a RESM instruction is executed and the thread reini- 
tiates. 

i-l-i LOAD / LOADX : Data transfer from global memory to operand 
memory 

• LOAD instruction takes two parameters the first parame- 
ter specifies the OM location containing global memory address 
whereas the second parameter specifies the OM location where the 
data read is to be stored. In response to this instruction read 
request is sent to the memory controller of I - structure global 
memory through MR and thread is suspended,. IFU picks up another 
thread from the token queue and continue execution: The Read 
request format is shown below. 

<Read gma, ct, dr> 


where 



gma - address of global memory location 
ct - continuation token 
dr - destination register 

When this Read request is satisfied by I-structure memory 
controller, it responds by reading content of location gma. say 
V, and sends a message 
<Store v,ct.dr> 
to the MP . 

MP upon receiving return message inserts the values 
v.ct.dr in Data Queue and sends a continution token <FP.O> to 
Sequencer. The address 0 in CM stores RESM instruction. 

Initially at power-on time DM is uninitialized and can be 
initialized through RS unit only. The LOADX instruction is used 
to initialize DM location. 

Format of the instruction is LOADX a, x 
Where 

a - address of location in global memory (limited to 6 bits 

only) 

In all respects LOADX is similar to LOAD instruction. 

4.4.^ RESM : Comp 1 ete data transfer and Resume 

This is an extention of LOAD/LOADX instruction. The RESM 

istruction is stored in location 0 of CM. Upon completion of the 

memory read request a value is sent to Data Queue and a continua- 

tion token <FP.0> inserted in the Token Queue. When token with j 
thread address 0 is picked up by any of the TRSs, RESM r 



instruction is executed. Upon execution of this instruction, a 
tuple <v,ct,dr> is read from Data Queue. 

Value V is stored in register dr and thus data movement 
from global memory to OM is completed. Also new thread token <ct> 
is inserted in TQ so thread continue exactly from the location 
next to LOAD/LOAOX instruction. 

STORE / STOREX : Move data from operand memory to g 1 oba 1 

memory 

Write request is sent to the memory controller of I - 
structure global memory through MP and thread is continued. The 
Write request format is shown below 
<Write value, gma> 
value - data to be stored 

gma - address of global memory location 

Memory controller on receiving Write message stores value 
in location gma. If location gma has deferred list of pending 
LOADS then it sends messages 
<Store value, CT,A> 
to MP . 

Where 

CT - corresponding continuation token 
A - corresponding destination register in OM 

Similar to LOADX , STOREX is used to store data in a fixed block 

- structure memory. It takes two parame~ 


of first 64 words in I 
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ters. The first parameter specifies the 6 bit address of location 
in global memory and the second parameter specifies the value to 
be stored in there. 

^• 2 . 1 nst ruct i on Set Summary : 

Instructions ADD , SUB , AND . OR . XOR . SFTL . SFTR , STORE . STOREX do not 
generate new thread token. The execution continues from the sub- 
sequent location, i.e IP <-- IP -l- l. in other words, the thread 
cont i nues . 

MFORK generates upto 4 new thread tokens and the parent 
thread also continues. 

For jump like instructions, the next location can not be 

determined by IFU till the execution is complete. Hence 
♦ 

JMP, JZ , JP , JPZ , JNZ generate new thread token and the parent thread 
dies. i.e. IP is set to new thread fetched from TQ. 

LOAD.LOADX generate new thread token with thread address 
0 and consume one. 

RESM generates new thread token and consumes one. 

MJOIN may/may not generate new token and consumes one. 

Thus once TQ and CM is loaded Twine RISC itself generates 
and consumes threads and extract paral 1 e 1 i sm<spat ial ) by allowing 
more than one TRSs to be active and executing different threads. 
With a very little compiler effort Twine RISC can execute threads 
in paral lei . <See AppendixCl?) 



Chapter 2- Simulator and Performanrs Evaluation 

2-2 Introduct i on and Qverv i ew 

In this chapter, we discuss the simulator for Twine RISC. 
The simulator provides a useful tool for the development of its 
architecture and has been used to modify its original design. 
Various parameters of the architecture are dependent on the 
currently available VLSI technology which form the input to the 
s imulator . 

The rest of the chapter is organized as follows. In sec- 
tion 5.2 we discuss the structure of the simulator. Input and 
output interface of the simulator is also discussed in this sec- 
tion. Metrics for the performance and performance evaluation 
aspects are discussed in section 5.5. In section 5.4, we discuss 
the usefulness of this simulator and discuss how it had been used 
to enhance the design. Finally we conclude this chapter in sec- 
tion 5 . 5 . 

2-2, Simulator Structure 

Simulation is divided into two parts viz; preparation of 
input data and execution. 

2.2_*i, Input Preparat i on 

A program written in conventional language is first con 
verted to its equivalent dataflow graph. This dataflow graph is 
then converted into machine language program of Twine RISC. Simu- 
lator requires 5 fi.les viz; GMF, CMF. TQF as its input. Here GMF 
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fiU contains the initial global memory image. CMF file contains 
the program code ,n binary. TQF fiu contains the initial token 
queue image. These files are prepared as follows : 

Ail immediate data are seperated from instructions as 
instructions only refer OM locations for operand values. These 
initial available data with other input data are put into file 
GMFCgiobai memory file) at appropriate locations. These values 
moved to OM through LOAOX instructions. 

We provide Twine RISC instructions to the simulator. The 
code in current model of execution is fed manually. This can how- 
ever be done through a compiler at a later stage. The instruc- 
tions to the simulator are fed using mnemonics. This mnemonic 
instruction code is converted to machine level binary equivalent 
code by code converter. Binary coded instructions are stored in 
code memory image file CMF, A compiler can directly generate CMF 
file to use the simulator. 

By looking at the CMF explicit threads are d i st i gu i shed. 
Such thread addresses are kept in TQF(token queue file). 

Execut i on 

Simulator program<Mai n( > > asks user to enter GMF . CMF and 
TQF file names. This can also be given as coiwnand line arguments 
to the simulator. Programs stores data from GMF into global I 
structure memory simulated by it. It stores instructions from CMF 
into Code Memory and datafthread addresses) from TQF is inserted 
into Token Queue. After this initialization is done. Simulator 
enters the function Execute<) with Do-While loop. 



Initially all TRSs are inactive. They are given priority 
according to their tag no., i.e. TRS^^l has highest priority over 
others for fetching new thread address from TQ. Otherwise all 
TRSs simultaneously attempt to read TQ creates problem. 

At simulated global clock tick //T, TRSy^l.IFUO reads TQ 
and fetches instruction from CM. It also sets IP to IP + 1 if 
thread is strictly sequencial. Also it raises its status bit in 
active statell). When TRS pipeline is empty this bit is turned to 
inactive state<0>. 

At clock tick ii^^T + l, TRSii^^l . OFU ( ) reads buffer, fetches 
operands from OM and writes data packet into buffer. At same 
instant, TRS»^'1 . IFU< > reads another instruction and writes data 
packet into buffer. Thus TRS#1 pipeline strats filling. - 

At the same time, if TQ is not empty (separate status bit 
is provided) . TRS//2.IFU<> reads thread address from TQ. fetches 
instruction from CM and start executing new thread. Thus under 
favourable conditions, after clock tick #T+4, all four TRSs are 
active, executing different threads of computation. 

In case of just starting address is kept in TQ, TRS//1 
continues execution of thread and generates new continution 
tokens which fill TQ. 

Execute< > -loop is terminated when following conditions 
are satisfied. 

1. TQ is empty. 

2. All TRSs are inactive. 

Then program(Main< ) > asks user to enter global memory 
addresses where output data are stored and then displays computed 
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results and stores it into Result File. 

Simulator organization and code structures are 
in Append i x B . 

5 . 3 Performance Met r i cs 

Simulator is written on the base of architectural assump~ 
tions we made. Since it is difficult to time various units pre- 
cisely we cannot compare performance of the Twine RISC architec- 
ture with other processors. Basically it is targeted for checking 
the performance of architecture in exploiting available parallel- 
ism as claimed previously. 

Several program codes have been implemented in Twine 
RISC's native graph representat i ons and run on the prototype 
simulator. The measured performances shows that Twine RISC indeed 
can execute dataflow graphs with its software environment which 
requires very little compiler efforts. Also in the codes tested 
on Twine RISC Simulator, the number of instructions required was 
nearly equal to that for conventional control flow processor. The 
Twine RISC processor's ability to exploit parallelism is evident 
when four diagnostic loops were run together and pipelines are 
kept full. 

Sample programs and results are reported in Appendix It. 

Some Oes i qn Issues 

We had started with writing simulator on architectural 
and software support assumptions made in [123- But it was quickly 
found that proposed architecture and its software environment 



does not support each other, e.g. MFORK. That lead us to make 
changes both in software constructs and in architecture design. 
Initialization of OM came into picture when prototype simulator 
was ready. Which lead us to include LO ADX instruction and also to 
maintain. To avoid a complex OM implementation. Data Queue inclu- 
sion was deemed right. Finally Token Queue management and expli- 
cit thread generation and synchronization requirement lead us to 
put RESM instruction at fixed location 0 in Code Memory and some 
change in instruction's format. 

As simulator writing was in progress need for precise 
specifications arose. That lead us to develop suitable instruc- 
tion set with its very careful coding to have minimum hardware at 
various stages to decode instructions. Also size of the various 
blocks are considered with available VLSI technology and state- 
of-the-art memory design. e.g. Operand Memory (Regi ster File), Data 
Queue. Messege Processor, Buffers, Token Queue etc. 

Summary 

Simulator writing has provided us lot of feedback in mak- 
ing imjor changes in the architecture to make it foolproof. Per- 
formance evaluation shows that Twine RISC is able to fulfill its 
goals of executing dataflow graphs efficiently with economical 
architectural framework. 



Chapter D i scuss i on . Concl us i on and Future Work 


Introduct i on and Overv i ew 

In this chapter, the basic philosophy that motivated our 
work is stated. The overall results of this work are summarized. 
To conclude, we suggest future research work needed to support 
our work. 

Ph i losophy 

% 

Twine RISC processor design is targeted for enhancing the 
performance by exploiting both temporal and spatial parallelisms. 
Basic architectural framework was already there with essential 
software environment d i rect i ves . [ 1 2 ] . Before going for its 
hardware implementation simulation was needed to detect design 
errors . 

Summary 

We have implemented the simulator for one Twine RISC 
Stream on SUN 3/60 under SUN OS using C. The simulator is based 
on event driven model and provides the time trace of a simulator 
run . 

The instructions for Twine RISC simulator can be provided 
in mnemonics which are then converted to their binary equivalent 
using a code converter developed through this work. 

Synchronization in pipeline is also observed through some 
test programs written in Twine RISC assembly language and run on 



the simulator. Through the simulator runs it is clear that thf 
architectural assumptions of Twine RISC exploiting maximum avail: 
able parallelism is true. 

Scope for Future Nork 

Simulator writing has provided us substantial feedback ir 
making major changes in the design. The simulator however does 
not provide precise timing analysis, This can be incorporated for 
performance evaluation of the architecture. 

Currently the input to the simulator is provided through 
hand coded mnemonics. A compiler interface can be developed foi 
thispurpose. 

As Twine RISC processors can be used to build high per- 
formance parallel computer, an exact protocol to do so need to bs 
dev'e loped. 
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Append i x A : I nstruct i on Set 

This appendix consists of four sections. Section A.l 
gives instruction set of Twine RISC. Section A. 2 describes 
instruction flow in TRS pipeline. Section A.? gives coding of 
instruction set. And finally A. 4 gives the short sunwary of 
instruction set. 

A.l, Instruct i on Set ; 

The instruction set of a Twine RISC processor is intended 
to be a simple extension of the RISC model. There are totally 19 
different instructions, each of which is capable of being exe- 
cuted in a single clock cycle<Except MFORK instruction). Instruc- 
tions are classified in four major categories viz. Arithmetic & 
Logic, Branch, Memory references and Generation and synchroniza- 
tion of multiple threads. A’V\) 

A.^ Instruct ion Execut i on i n TRS p i pe 1 i ne i 

A._2. 1_ Ord i nary RISC like i nstruct i ons : 

a. ADD, SUB, AND. OR, XOR : 

All these instructions fetch two operands from OM and store 
one result back to OM. 

Syntax of these instructions is 

opcode rl r2 r? 

rl - left operand source register 


r2 - right operand source register 
r5 - destination register 
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e.g. consider opcode ADD 

OFU fetches two operands CFP.rl] and CFP.r23, EXU adds them as 
CFP.rl] + CFP.r23 --> value and passes <value. r3> to RSU . RSU 
stores value in CFP.rl]. Execution continues from next 
1 ocat i on < i . e . IP + 1). 

b. SFTL, SFTR : 


As we are operating on 52 bit operands we can shift it by at 
most 52 bits. The shift count is stored in the instruction in 6 
bits only. Thus we need to fetch only one operand from DM and 
store result back to DM. 

Syntax of the instructions is 

opcode a r2 r5 

a - value specifies how many bits to be shifted< stored in 
instruction) 

rl - operand source register 

r5 “ destination register 

e.g. consider SFTL a, r2 , r5 instruction. 

OFU fetches one operand CFP.r23, EXU operates and computes the 
result value as CFP.r2] << a — > value and passes <value, r5> to 
RSU. RSU stores value in CFP.r53. Execution continues from IP + 
1 . 


c.. JMP 

This instruction supports direct jump up to 18 bit range. As 
jump address is directly specified in 18 bits in the instruction, 
there is no operand fetch from DM. 

Syntax of this instruction is 
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JMP X 

X - 18 bit value specifies jump address 

GFU does not fetch any thing from OM. x is treated as one 

operand. EXU generates continuation token <FP.x> and passes it to 
CTU . CTU forwards this continuation token to Sequencer which then 
inserts it into TQ . No result is written to OM. 

d. 31, JP, JPZ. JNZ 

These instructions support conditional jump up to 12 bit 

offset range. As jump offset is directly specified in 12 bits in 
instruction, we need to fetch only one operand(in which condition 
value is stored) from OM. 

Syntax of these instructions is 
JCOND rl X 

rl - condition operand source register 
X - 12 bit value specifies jump offset 

OFU fetches one operand [FP.rl], x is treated as another 

operand, EXU. tests the condition CFP.rl} and if condition is true 

EXU generates continuation <FP.IP + x> else <FP. IP + 1> is gen- 
erated. EXU passes this continuation token to CTU. CTU forwards 
this continuation token to Sequencer which then inserts it into 
TQ . No result is written to OM. 

A.Z.Z Spec i a 1 Instruct i ons : 

These instructions are extension of the RISC model. 

a. MFORK 

The MFORK instruction is a method of spawning parallel threads 



of computation from within an executing thread. 

New thread offsets are organized as nl,n2.n?,n.4 each 8 bits. 
Which then grouped in a 52 bit number stored in OM.Thus only one 
fetch from OM is required. 

Syntax of this instruction is 

MFORK r2.r3 

r2 - operand source register which contains a grouped number 
from which new thread offsets are derived. 

r5 - destination register 

OFU fetches one operand CFP-r23 from OM, EXU interprets 
CFP.r2D as four blocks of 8 bits each. 

For each byte if byte value is nonzero then value is con- 
sidered as an offset. EXU prepares a continuation token <FP.IP + 
offset value> and passes it to CTU. CTU forwards these , continua- 
tion tokens to Sequencer. 

One continuation <FP. IP + 1> is always there.- 

Number of new threads generated is derived as follows : 

Bytes No . of new threads 


//I 

//2 


//A 

(value) 

NZ 

X 

X 

X 

5 

Z 

NZ 

X 

X 

4 

Z 

Z 

NZ 

X 

5 

Z 

Z 

Z 

NZ 

2 


Here 

Byte //I is the most significant byte 
NZ - nonzero value 
Z - zero 
X - don't care 

EXU passes <value,r5> to RSU. 


RSU stores value in CFP.r53< 
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b. MJOIN 

The MJOIN instruction allows multiple threads to synchronize 
execution. Only one fetch from OM is required. This instruction 
decrements the content of the location specified by one and write 
back the result in the same location. 

Syntax of this instruction is 

MJOIN r2.r2 

r2 - operand source register contains number of threads to be 
synchronized ■ 

r2 - destination register 

OFU fetches one operand CFP.r23 from OM, EXU decrements 
CFP-r2] by 1 and tests it. If result value is zero then continua- 
tion <FP.IP + 1> is passed to CTU else thread dies. EXU also 
passes <result value, r2> to RSU. RSU stores value in register 
CFP.r2]. 

If continuation token <FP.IP + 1> is generated then it is for- 
warded to Sequencer by CTU. 

c. LOAD, LOADX 

These instructions are used to send a request to move data 
from global I - structure memory to OM. 

Syntax of these instructions are 

1 . LOAD ax 

a - operand source register contains address of global memory 
location 

X - destination register 

OFU fetches one operand [FP.a] from OM, EXU sends a read 
request to MP which then forwards it to memory controller of I - 
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structure memory and thread dies. The request format is 
<Read CFP.a3. FP.IP + 1 , x> 

CFP.a3 - address of global memory location 
FP.IP + 1 - continuation token 

X - destination register in which data is to be moved 

• 2. LOADX a x 

a - 6 bit value specifies address of global memory location in 
instruction itself 

X - destination register 

OFL) does not fetch any operand from OM, a is treated as an 
operand. EXU sends a read request to MP which then forwards it to 
memory controller of I - structure memory and thread dies. The 
request format is 

<Read a, FP.IP + 1, x> 
a “ address of global memory location 
FP.IP + 1 - continuation token 

X - destination register in which data is to be moved 

When this read message is processed by I - structure memory 
controller, it responds by reading contents of the location and 
sends a message to MP . The message format is 
<Store V. FP. IP + 1 . x> 

V - data value 

1 Upon receiving a return message from memory controller the MP 
inserts the values v. FP.IP + 1. x in Data Queue and sends a con- 
tinuation token <FP.0> to Sequencer. Here continuation token 
<FP.0> corresponds to instruction RESM in CM. 



d. RESM 


When a memory operation LOAD/LOADX is issued, upon completion 
of the operation a value is sent to Data Queue, and a RESM 
instruction is executed to move data from Data Queue to DM and 
the thread reinitiates. This is the only instruction fetches 
operands from Data Queue. 

Syntax of this instruction is 
RESM 

OFU fetches data 

<v, FP.IP + 1 , x> from Data Queue instead of DM. 

V and FP.IP + 1 become two operands and x the destination 
regi ster . 

EXU passes <v,x> to RSU. 

EXU also prepares a continuation token <FP.IP + 1> which is 

then forwarded to CTU. 

RSU stores v in register CFP.x3 and finally data is moved into 

DM . 

CTU sends <FP.IP + 1> to Sequencer which then inserts it into 
TQ. . . • 

Here <FP.IP + 1> is not <FP.1> as RESM is at <FP.Q> but this' 
continuation token is read from Data Queue. 

e. STDRE, STOREX 

These instructions are used to send a request to move data 

) 

from DM to global I - structure memory. 

Syntax of these instructions are 
1. STORE X. a 

X - operand source register contains address of global memory 
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location 

a - operand source register from which data is to be moved to 
global memory 

OFU fetches two operands CFP.x] and CFP.aD from OM, EXU sends 
a message to MP which is then forwarded to memory controller of I 
- structure global memory and thread continue. The message format 
i s 

<Write CFP.a], CFP.xl> 

CFP.x] - address'of global memory location 

CFP.aD - value to be written 

Memory controller upon receiving Write message stores value 
[FP.al in location CFP.x]. If location CFP.x3 has deferred list 
of pending LOADs then memory controller sends store messages to 
MP. The message format is 
<Store V, CT . A> 

V - data value 

CT - corresponding continuation token 

A - corresponding destination register 

2. STOREX X, a 

X - 6 bit value specifies address of global memory location in 
instruction itself 

a - operand source register from which data is to be moved to 
global memory 

OFU fetches one operand CFP.a] from OM, x is treated as other 
operand. 

EXU sends a message to MP which is then ^ forwarded to memory 

- structure global memory and thread continue. 


contro 1 1 er of I 
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The message format is 
<Write CFP.a], x> 

X - address of global memory location 

CFP.a] - value to be written 

Memory controller upon receiving Write message stores value 
CFP.a] in location x. If location CFP.x] has deferred list of 
pending LOADs then memory controller sends store messages to MP. 
The message format is 
<Store V, CT , A> 

V - data value 

CT - corresponding continuation token 

Execution continues from IP + 1 . 

A-?. Instruct i on Set Cod i ng 

The instruction set of Twine RISC is coded in such a way 
that by decoding minimum number of bits at various units of the 
pipeline instructions are recognized. There are 19 instructions 
in the instruction set sparsed over 6 bits of opcode. 

IFU : IFU decodes the first bit of the opcode and decides the 
location of next instruction whether it comes for subsequent 
location, i.e IP + 1 or not. In the later case, new thread 
address is taken up from Token Queue for execution. If first bit 
is 0 then IP is incremented to IP + 1 else new thread address is 
.fetched from TQ. 

Also we need to identify MJQIN instruction at this stage only. 
This is needed for seting the global MJOIN lock and thus provid- 
ing MJOIN execution in exclusion. Opcode for MJOIN is 111111 (all 
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I's) which can be decoded with minimum hardware support. 

OFU : Here we need to recognize whether OFU has to fetch two 
operands, one operand or no operand from OM. Also it identifies 
the RESM instruction for which the operands are fetched from Data 
Queue, By decoding last two bits of opcode this can be done as 

00 - two fetches from OM 

01 - fetch from DQ 

10 - no fetch from OM 

11 - one fetch from OM 

In some instructions, second operand is specified in the 
instruction itself. This is implemented by decoding yet another 
bit. 

a. last 12 bits are treated as second operand or not, i.e. for 
conditional jumps and other instructions in one fetch category 
like SFTL.SFTR.MFORK etc. 

b. last 18 bits are treated as an operand or not, i.e. for 3MP 
and LOADX in no fetch category. 

EXU : By looking at first 5 bits all instructions can be 
decoded. 

Coding of instruction set is given in table.^. Fi'g . A*3 -A 

A* S 

Summary of instruction set is given in table:#, rig. 



so 


11 (1 fetch) 





SFTL 

SFTR 

MFORK 

STOREX 
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JNZ 



LOAD 

MJOIN 
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101 1 

1100 

LOLL 

1110 

LL.LL 
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Append i x ^ : Code Structure for S imu I at or 


This appendix gives basic code structure for Simula 

tor . 

B-i. Structure of Simulator 

ma i n ( > 

C 

i nput < ) ; 

initial i zat i on ( > ; 
do C 

execut i on( > : 

} while ( cond i t i oni^l ) ; 
output ( > ; 

} 

i nput ( ) 

C 

read GMF , CMF . TQF ; 

1 

initial! zat i on ( ) 

C 

load global memory; 

load code memory; 

load token queue; 

set status of TRSs = 0; 

set status of all buffers = 0; 


} 



execut i on< > 


C 

TRS//1 < > . TRS#2 ( > . TRS^'3 ( > , TRS#4 ( ) 

} 

output ( ) 

C 

ask for gm addresses ; 
display resluts; 
store results to RF ; 

} 

TRS:!^?! < ) 

£ 

read TQ; 

set TRSi^l . status = 1; 
do C 

IFU( ) ; 

OFU( ) ; 

EXU< ) ; 

RSU( ) .CTUC > .SEQ( ) ; 

} wh i 1 e < cond i t i on#2 ) ; 
set TRS#1. status = 0; 

} 

IFUO 

£ 

fetch instruction from CM; 
decode opcode; 


set IP; 



forward data to buf f er^l ; 


} 

OFU( ) 

C 

read buff er#l ; 

set buf fer^I . status = 0; 

decode opcode; 

fetch operands; 

forward data to buffer^Z; 

} 

EXU< ) 

C 

read buf f er;!^i‘2 ; 

set buff eri^Z . status = 0; 

decode opcode ; 

operate on data; 

forward result to buffer#3, 

forward continuation to bufferi!^^4 

} 

RSU( ) 

C 

read buffer#3; 
set buf fer;^3 - status = 0; 
send handshake to EXU; 
store data in ‘QM; 

) 


CTU< ) 



c 


read buffer^?4: 

set buff er#4 . status = 0; 

forward continuation to SEQ; 

> 

SEQ( ) 

C 

insert continuations into TQ 

> 



Appendix C : User ’ s Mannual and Test Programs 

This appendix consists of two sections. Section C.l 
gives directives to run the Twine RISC simulator. Section C.2 
shows how Twine RISC extracts para 1 1 e 1 i sm from programs. Two sim- 
ple example programs are considered. For simplicity multiplica- 
tion instruction is included in the instruction set. 

Q.-L User ’s Mannua 1 

Simulator is written using 5 files. 

- [func.c] contains all functions used in main file. 

- Ca.cD is main file, calls functions. 

- Cglobal.h] contains global variables and data struc- 
tures used in simulator. 

Executable file Csim] is to be generated using Makefile 
specifically written for this simulator. 

To run simulator give command 

sim gmf cmf tqf rsf 

Before running the simulator files gmf, cmf, tqf. and rsf 
are to be prepared. 

Cgmf] file contains initial global memory image. It con- 
tains global memory locations and data values. 

Ccmf] file contains code memory image. It contains CM 
locations and instructions. To prepare this file first one has to 
prepare temp file consists of locations, mnemonic instruction 
code. A code converter converts this temp file to machine level 
binary equivalent file. For that [convertl! is written. 



S7 


Ctqfl file contains explicit thread addresses known 
before hand. 

Crsf] file contains global memory locations from where 
result values can be stored after computation. 

Simulator generates output file CroutD, which contains 
global memory locations given in Crsf] and values computed in 
program. 

All these image files are given for test program 1. (see 
f i gj^d-'^; % > 

C._2 Test Programs 

Program 1 . 

Compute X,Y,Z. 


X = 

CA*B] 

+ CC*0]; 


Y = 

CA*D] 

- CB*C]; 


Z = 

CD*C3 

- CC+A]; 


Here 

it is 

evident that 

once A,B,C, and 0 are available. 

and Z 

can 

be computed 

concurrently. Also A,B,C,D can be 


moved to operand memory registers by dispatching concurrent 
LOADS. Parallel control flow for LOADs is done with the help of 
MJOIN instruction, which allows execution to continue only after 
A,B,C, and D are moved into OM. As these LOADs are explicitly 
known, their addresses are, inserted in TQ before execution. 
X,Y,and Z is computed simultaneously with the help of MFORK 
instruction, which generates parallel threads of computation for 
X,Y, and Z. X,Y, and Z then can be synchronized for further com- 
putations. As this program does not have any iterations frame 
pointer FP is same<i.e.=0) for all instructions. 



5-2 

i 



■V T 


Sequential and Parallel cori’tYol -flows -for LOADs. 



C ■ 1 - 1 . 


















location instruction 


0 

resa 

/sresurae instruction at location Os/ 


1 

load'/! 1 1 

/sload mjoin register 1 (for synchronising 

loadxs)s/ 


jiRP 12 

/sjump to synch Is/ 


3 

loadx 2 2 

/sBove A to OM register 2s/ 


A 

j Bp 1 2 

/sjump to synch Is/ 


5 

load” 3 3 

/smove B to register 3s/ 


6 

jffi? 12 

/sjump to synch Is/ 


7 

loadx 4 4 

/smove C to register 4s/ 


3 

imp 12 

/sjump to synch Is/ 


9 

loadx 5 5 

/smove D to register 5s/ 


10 

j mp 1 2 

/sjump to synch Is/ 


11 

Icaox 6 6 

/sload mfork register 6 (for generating new 

threadsls/ 

to 

IX. 

mjoin 1 

/ssynch Is/ 


13 

mfcrk' 6 7 

/snew threads generateds/ 


14 

mul 2 3 10 

/sthread #1 parents/ 


15 

fflul 4 5 11 



16 

add 10 11 12 



17 

jmp 25 

/sjump to synch 2s/ 


13 

mul 5 2 13 

/sthread #2s/ 


19 

mul 3 4 14 



20 

sub 13 14 15 



21 

jmp 25 

/sjump to synch 2s/ 


22 

mul 5 3 16 

/sthread #3s/ 


23 

add 4 2 17 



24 

sub 16 17 18 



25 

mjoin 7 

/ssynch 2s/ 


26 

storex 12 12 

/sstore X in global memory location#12s/ 


27 

storex 15 15 

/sstore Y in global memory location#15s/ 


28 

storex 18 18 

/sstore Z in global memory locationtlSs/ 



craf ; code meiBory image file (mnemonic code) for prog.l 


Pi 


3 " 


C -Z- ^ 



location data value 


1 

S 

/smjoins/ 


10 

/sAs/ 

3 

5 

/»8s/ 

4 

6 

/sCs/ 

g. 

U 

n 

X. 

/sD»/ 

h 

2309 

/smfork #nl=l 


gmf ; global memory image file 


FP.IP 


0 . 1 
0 . 3 
0 . 5 
0 . 7 
0 . 9 
0 . 11 

1^, tqf : initial token queue image 


location 


12 

15 

18 

C. rsf : result locations file 


location 

data value 

12 

62 

/*X=(A*B)+(C»D)x/ 

15 

-10 

/»Y=(A*D)-(B»C)»/ 

18 

-6 

M=(D»B)-(C+A)*/ 


D. rout : result output file 


Files for program 1 



Program 2 . 


Compute vector inner product. 

for i=l to n 
S = S + ACi]*BCi3; 

Here parallelism is exetracted in two ways. 

1. ACi3 and BCi] is simultaneously loaded. 

2. For different i, ACi3*BCi3 computed concurrently and then are 
added. 

To compute all iteration codes in parallel, we have to 
provide different frame pointers FP for each i. So that with same 
code block in instruction memory execution is performed on dif" 
ferent operand register sets. (see f i ) 

However, the programs stand "apparently” sequential, much 
parallelism is exploited by Twine RISC architecture -- all index 
calculations, loads and multiplications can be done in parallel- 
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